{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3b_pruning_llama_optipfair.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DT0FAsUnncFY"
      },
      "source": [
        "<div>\n",
        "    <h1>Large Language Models Projects</a></h1>\n",
        "    <h3>Apply and Implement Strategies for Large Language Models</h3>\n",
        "    <h2>Pruning Llama 3.2. with OptiPFair</h2>\n",
        "    <h3>Using OptiPFair to prune MLP Layers with GLU structure.</h3>\n",
        "</div>\n",
        "\n",
        "by [Pere Martra](https://www.linkedin.com/in/pere-martra/)\n",
        "\n",
        "________\n",
        "Models: meta-llama/Llama-3.2-1B\n",
        "\n",
        "Colab Environment: GPU T4.\n",
        "\n",
        "Keys:\n",
        "* Pruning\n",
        "* Structured pruning\n",
        "* optiPfair\n",
        "\n",
        "_______\n",
        "**disclaimer: The pruning section was created after the first edition of the book was published. They are not included in the book’s original content but are intended to supplement and expand on the topics covered.**\n",
        "\n",
        "This is the unofficial repository for the book:\n",
        "        <a href=\"https://amzn.to/4eanT1g\"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).\n",
        "        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.\n",
        "        If you are looking for the official repository for the book, with the original notebooks, you should visit the\n",
        "        <a href=\"https://github.com/Apress/Large-Language-Models-Projects\">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.\n",
        "\n",
        "This notebook serves as a demonstration code for the paper [Exploring GLU Expansion Ratios: Structured Pruning in Llama-3.2 Models.](https://doi.org/10.31219/osf.io/qgxea)\n",
        "\n",
        "The paper studies how the % of expansion produced in the GLU layers influences performance and consumption. For this purpose, seven different models have been generated from the Llama-3.2-1B and Llama-3.2-3B base models, reaching the conclusion that the optimal balance is achieved with an expansion of 140%.\n",
        "______"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iEwxZCVsoIau"
      },
      "source": [
        "# Introduction\n",
        "This notebook cotinues the work done at: [6_3_pruning_structured_llama3.2-1b_OK.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb) the pruning process was done manually, and you can find the implementation code there. In this notebook, we use the [OptiPFair](https://github.com/peremartra/optipfair) library, developed by myself, which simplifies the pruning process for LLMs.\n",
        "\n",
        "En este notebook nos focalizamo en explicar el funcionamiento de la libreria OptiPFair y sus diversas opciones para realizar el pruning de las capas MLP de modelos con estructura GLU: Llama, Gemma, QWen, Mistral y otros.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eQIxAOPZtPBN"
      },
      "source": [
        "#Install libraries & Configure variables."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "5zHApVm41HWq",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "f542547a-13b6-418d-cf7f-18b1217d7603"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m363.4/363.4 MB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.8/13.8 MB\u001b[0m \u001b[31m124.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.6/24.6 MB\u001b[0m \u001b[31m97.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m883.7/883.7 kB\u001b[0m \u001b[31m65.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m664.8/664.8 MB\u001b[0m \u001b[31m2.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.5/211.5 MB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 MB\u001b[0m \u001b[31m17.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m127.9/127.9 MB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.5/207.5 MB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.1/21.1 MB\u001b[0m \u001b[31m107.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ],
      "source": [
        "!pip install -q transformers\n",
        "!pip install -q optipfair"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "pip show optipfair"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "eqSfFAukLZ_J",
        "outputId": "913ee4e9-8fbf-4424-caf8-fb14c97cf724"
      },
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Name: optipfair\n",
            "Version: 0.1.3\n",
            "Summary: A library for structured pruning & Bias visualization of large language models\n",
            "Home-page: https://github.com/peremartra/optipfair\n",
            "Author: Pere Martra\n",
            "Author-email: peremartra@uadla.com\n",
            "License: \n",
            "Location: /usr/local/lib/python3.11/dist-packages\n",
            "Requires: click, torch, tqdm, transformers\n",
            "Required-by: \n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "GJNgRj4M187E"
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "import os\n",
        "\n",
        "from tqdm import tqdm\n",
        "from optipfair import prune_model\n",
        "from transformers import AutoModelForCausalLM, AutoTokenizer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tbIyUlXEtbqs",
        "outputId": "c707b2da-369c-4522-f3ea-bf61d190516a"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Using device: cuda\n"
          ]
        }
      ],
      "source": [
        "# Check if GPU is available\n",
        "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
        "print(f\"Using device: {device}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sM-QwxyKw-YG"
      },
      "source": [
        "#Download model and explore structure"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "q-z_1Zpg2I6u"
      },
      "outputs": [],
      "source": [
        "model_name = 'meta-llama/Llama-3.2-1B'\n",
        "model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
        "#tokenizer.pad_token = tokenizer.eos_token  # Set pad token"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "9UpMD4Hw2MWg"
      },
      "outputs": [],
      "source": [
        "def get_output(prompt, model=model, tokenizer=tokenizer):\n",
        "    inputs = tokenizer(prompt, return_tensors='pt').to(device)\n",
        "    outputs = model.generate(\n",
        "        inputs['input_ids'],\n",
        "        attention_mask=inputs['attention_mask'],\n",
        "        max_length=50,\n",
        "        num_return_sequences=1,\n",
        "        pad_token_id=tokenizer.pad_token_id,\n",
        "        temperature=None,\n",
        "        top_p=None,\n",
        "        do_sample=False,          # Disable sampling\n",
        "        num_beams=5,              # Use beam search\n",
        "        early_stopping=True,      # Stop when end-of-sequence token is generated\n",
        "        no_repeat_ngram_size=2    # Prevent repetition of 2-grams\n",
        "    )\n",
        "    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
        "    return generated"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4muyx_8M5OAu"
      },
      "source": [
        "## studying the model structure\n",
        "As demonstrated in the [previous notebook](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb), studying the structure of the model that will undergo pruning is crucial.\n",
        "\n",
        "In this notebook, we’re going to fine-tune the pruning process for the Llama3.2 model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "y5Hs4oQ4B7Z0",
        "outputId": "d2b81330-b384-452d-934e-f2e0cd096871"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "LlamaForCausalLM(\n",
            "  (model): LlamaModel(\n",
            "    (embed_tokens): Embedding(128256, 2048)\n",
            "    (layers): ModuleList(\n",
            "      (0-15): 16 x LlamaDecoderLayer(\n",
            "        (self_attn): LlamaAttention(\n",
            "          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)\n",
            "          (k_proj): Linear(in_features=2048, out_features=512, bias=False)\n",
            "          (v_proj): Linear(in_features=2048, out_features=512, bias=False)\n",
            "          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)\n",
            "        )\n",
            "        (mlp): LlamaMLP(\n",
            "          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)\n",
            "          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)\n",
            "          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)\n",
            "          (act_fn): SiLU()\n",
            "        )\n",
            "        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)\n",
            "        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)\n",
            "      )\n",
            "    )\n",
            "    (norm): LlamaRMSNorm((2048,), eps=1e-05)\n",
            "    (rotary_emb): LlamaRotaryEmbedding()\n",
            "  )\n",
            "  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)\n",
            ")\n"
          ]
        }
      ],
      "source": [
        "print(model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yPMslK3QCAb1"
      },
      "source": [
        "\n",
        "An MLP block typically consists of layers that scale the data to larger dimensions and others that return it to its original size.\n",
        "\n",
        "In the MLP block of the model, we find two projection layers: `gat_proj` and `down_proj`, both scaling from 2048 to 8192. The purpose of having two layers projecting to the same intermediate size might be related to gating mechanisms. A gating mechanism selectively controls information flow in neural networks by using learned weights to \"gate\" or filter inputs.\n",
        "\n",
        "However, to truly understand how these layers function, we’d need to refer to the model's documentation or even the source code. But, this structure usually indicates, at least, I haven't encountered a case where it doesn't, that the layers performing the upsizing work in pairs, and they cannot be treated as independent linear layers.\n",
        "\n",
        "In other words, any operation we apply to one layer must be replicated in the other. Most importantly, when identifying which neurons have more or less importance, we can't evaluate the neurons of a single layer in isolation; we need to treat them as pairs.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "alKH3QH64WFL",
        "outputId": "852b54cb-5537-4d86-a021-29a15ff83768"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Generated text: Paris is the capital of France and one of the most visited cities in the world. It is a city with a rich history and culture, as well as a vibrant and diverse population. Paris is home to many famous landmarks, including the Eiff\n"
          ]
        }
      ],
      "source": [
        "# Test the original model\n",
        "prompt = \"Paris is the capital of\"\n",
        "generated = get_output(prompt)\n",
        "print(f\"Generated text: {generated}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CK9NwmBWnkSP"
      },
      "source": [
        "#Pruning the Model.\n",
        "##Support pruning functions.\n",
        "###Compute neuron importance functions.\n",
        "\n",
        "To perform the pruning process, you need to call the `prune_model` function from the library (OptiPFair)[https://github.com/peremartra/optipfair].\n",
        "\n",
        "To use this function, you need to provide the following parameters:\n",
        "\n",
        "* model: The model to be pruned.\n",
        "* pruning_type: The type of pruning to apply. In this case, it will be \"MLP_GLU\", which is currently the only option supported by the library.\n",
        "* neuron_selection_method:\n",
        "  * MAW: Maximum Absolute Weight Values.\n",
        "  * VOW: Variance Of Weigths.\n",
        "  * PON: Product Of Norms.\n",
        "  With LLaMA models, the method that works best without requiring further training is MAW.\n",
        "* pruning_percentage o expansion_rate: you need to provide one of them. In this notebook, we’ll use pruning_percentage.\n",
        "* show_progress: By default, it's set to True. It displays the progress of the pruning process.\n",
        "* return_stats: By default, it's set to True. It returns the percentage of neurons removed and the resulting expansion rate.\n",
        "\n",
        "\n",
        "*I’m leaving the others in the notebook purely as an exercise.*\n",
        "\n",
        "The **MAW** method works better because it directly identifies the most influential neurons based on the magnitude of their connections. These neurons are likely responsible for key decisions, making the model more accurate after pruning. The Variance of Weights method, while useful in some contexts, can retain neurons that may not contribute significantly to the task, leading to less coherent model outputs.\n",
        "\n",
        "However, we shouldn’t fall into the trap of assuming that this neuron selection method will work best across all model structures. It works well with Llama models, and this may be due to several factors:\n",
        "\n",
        "* The relatively large projection from 2048 to 8192.\n",
        "* The use of a GLU structure.\n",
        "* The type of activation function used.\n",
        "\n",
        "So, if we use a model from another family, like Gemma or Mistral, the neuron selection method might need to be entirely different."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KtHtSbRmS267"
      },
      "source": [
        "## Obtain & test the pruned model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "NIUnFU5R3n42",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "5df1c68a-4740-42f5-a34d-f79bc2dafc82"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Pruning layers: 100%|██████████| 16/16 [00:06<00:00,  2.41it/s]\n"
          ]
        }
      ],
      "source": [
        "# Prune 10% of neurons from MLP layers using MAW method\n",
        "pruned_model, stats = prune_model(\n",
        "    model=model,\n",
        "    pruning_type=\"MLP_GLU\",\n",
        "    neuron_selection_method=\"MAW\",\n",
        "    pruning_percentage=20,\n",
        "    show_progress=True,\n",
        "    return_stats=True\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tdJUkfWI3qMM",
        "outputId": "10fc16d7-0c83-4554-ee1d-e24368d21fc2"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Original parameters: 1,235,814,400\n",
            "Pruned parameters: 1,074,792,448\n",
            "Reduction: 161,021,952 parameters (13.03%)\n",
            "Expansion rate: 320.01953125%\n"
          ]
        }
      ],
      "source": [
        "# Print pruning statistics\n",
        "print(f\"Original parameters: {stats['original_parameters']:,}\")\n",
        "print(f\"Pruned parameters: {stats['pruned_parameters']:,}\")\n",
        "print(f\"Reduction: {stats['reduction']:,} parameters ({stats['percentage_reduction']:.2f}%)\")\n",
        "print(f\"Expansion rate: {stats['expansion_rate']:,}%\")\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# prompt: call the generate_response for the pruned_model\n",
        "\n",
        "# Assuming 'pruned_model' is defined as in the provided code.\n",
        "# Assuming 'get_output' function is also defined as in the provided code.\n",
        "\n",
        "prompt = \"Paris is the capital of\"\n",
        "generated = get_output(prompt, model=pruned_model) # Pass the pruned model to the get_output function\n",
        "print(f\"Generated text (pruned model): {generated}\")\n"
      ],
      "metadata": {
        "id": "9_Zc5VM86KTQ",
        "outputId": "b1e1b858-6913-46ee-97ea-6c8acfabea49",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Generated text (pruned model): Paris is the capital of France. It is also one of the most beautiful cities in the world. There is so much to see and do in Paris that it is impossible to cover it all in one day. However, there are some things you\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JGzXMQrVTULv"
      },
      "source": [
        "The result is slightly different from what the original model produced, but it’s still a fairly accurate response.\n",
        "\n",
        "In contrast to the model created in notebook: [6_2_pruning_structured_llama3.2-1b_KO.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb) where the pruned Llama model lost almost all its utility, the model in this notebook retains a good portion of its knowledge."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dDQrSrf-VCyI"
      },
      "source": [
        "Looking at the model’s new structure, we can see that the `gate_proj` and `up_proj` layers have had their `out_features` reduced to 6554 from 8192. Consequently, the `down_proj` layer has its `in_features` adjusted to match the new size."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "ATAiqZW30NYN",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "b0b4b451-8ef6-409e-ad84-9a6c8fbdc3fa"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "LlamaForCausalLM(\n",
            "  (model): LlamaModel(\n",
            "    (embed_tokens): Embedding(128256, 2048)\n",
            "    (layers): ModuleList(\n",
            "      (0-15): 16 x LlamaDecoderLayer(\n",
            "        (self_attn): LlamaAttention(\n",
            "          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)\n",
            "          (k_proj): Linear(in_features=2048, out_features=512, bias=False)\n",
            "          (v_proj): Linear(in_features=2048, out_features=512, bias=False)\n",
            "          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)\n",
            "        )\n",
            "        (mlp): LlamaMLP(\n",
            "          (gate_proj): Linear(in_features=2048, out_features=6554, bias=False)\n",
            "          (up_proj): Linear(in_features=2048, out_features=6554, bias=False)\n",
            "          (down_proj): Linear(in_features=6554, out_features=2048, bias=False)\n",
            "          (act_fn): SiLU()\n",
            "        )\n",
            "        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)\n",
            "        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)\n",
            "      )\n",
            "    )\n",
            "    (norm): LlamaRMSNorm((2048,), eps=1e-05)\n",
            "    (rotary_emb): LlamaRotaryEmbedding()\n",
            "  )\n",
            "  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)\n",
            ")\n"
          ]
        }
      ],
      "source": [
        "print(model)"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from optipfair.evaluation.benchmarks import time_inference, compare_models_inference"
      ],
      "metadata": {
        "id": "S27hrDroU04D"
      },
      "execution_count": 13,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Compare inference speed\n",
        "comparison = compare_models_inference(\n",
        "    model,\n",
        "    pruned_model,\n",
        "    tokenizer,\n",
        "    prompts=[\"Paris is the capital of\", \"The speed of light is approximately\"],\n",
        "    max_new_tokens=50\n",
        ")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "WsFAAAsrU3dQ",
        "outputId": "c7121c20-7112-4837-ecb4-5ab4777d74b0"
      },
      "execution_count": 18,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n",
            "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
            "Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "comparison"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "G4ExKBrBVGW7",
        "outputId": "36b574fb-4c99-403c-a21f-f61d12e6db2b"
      },
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'avg_original_time': 1.1571248769760132,\n",
              " 'avg_pruned_time': 1.0934675693511964,\n",
              " 'avg_original_tokens_per_second': 43.21177194570876,\n",
              " 'avg_pruned_tokens_per_second': 45.807976554364025,\n",
              " 'speedup': 1.0582159996410205,\n",
              " 'tps_improvement_percent': 6.0080956918802775,\n",
              " 'num_prompts': 2}"
            ]
          },
          "metadata": {},
          "execution_count": 19
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Conclusion\n",
        "This notebook serves as a demonstration of how simple it can be to create your own optimized models using the [OptiPFair](https://github.com/peremartra/optipfair/tree/main) library.\n",
        "\n",
        "The result is a model that uses less memory and is faster at inference than the original, while still retaining most of its knowledge.\n",
        "\n",
        "Of course, many more tests and a wider range of rankings are needed to fully understand the model’s performance. The evaluation module of the [OptiPFair](https://github.com/peremartra/optipfair/tree/main) library is still under development, so in the future it will be possible to apply more types of pruning and evaluate the resulting models with better benchmarks."
      ],
      "metadata": {
        "id": "2lBSAKQAUqQW"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lW-biFT6U0zQ"
      },
      "source": [
        "##Authors Note.\n",
        "In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.\n",
        "\n",
        "I am especially proud of my book: <a href=\"https://amzn.to/4eanT1g\"><b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).\n",
        "\n",
        "You can find it on both <a href=\"https://amzn.to/4eanT1g\">Amazon</a> and <a href=\"https://link.springer.com/book/10.1007/979-8-8688-0515-8\">Springer</a>, where they often have good deals on the purchase price.\n",
        "\n",
        "If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible."
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": [],
      "machine_shape": "hm",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}